Reordering Columns for Smaller Indexes 



Daniel Lemire^'*, Owen Kaser'' 

°-LICEF, Universite du Quebec d Montreal (UQAM), 100 Sherbrooke West, Montreal, QC, 

H2X 3P2 Canada 

''Dept. of CSAS, University of New Brunswick, 100 Tucker Park Road. Saint John. NB, 

Canada 



Abstract 

Column-oriented indexes — such as projection or bitmap indexes — are compressed 
by run-length encoding to reduce storage and increase speed. Sorting the tables 
improves compression. On realistic data sets, permuting the columns in the 
right order before sorting can reduce the number of runs by a factor of two or 
more. Unfortunately, determining the best column order is NP-hard. For many 
cases, we prove that the number of runs in table columns is minimized if we 
sort columns by increasing cardinality. Experimentally, sorting based on Hilbert 
space-filling curves is poor at minimizing the number of runs. 
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1. Introduction 

Many database queries have low selectivity. In these instances, column- 
oriented indexes such as projection, bit-sliced pQ [5] and bitmap indexes [3] 
might be preferable [1]. These indexes are often compressed using run-length 
encoding (RLE). Other data organization models benefit from column-oriented 
RLE compression such as the decomposition storage model ^ used by column 
stores [SHU] and the Partition Attributes Across (PAX) [T^ model. 

We wish to order the rows to generate faster and smaller column-oriented 
indexes. Sorting reduces the number of column runs — the sequences of identical 
values within rows |13j . 

Adabi et al. recommend lexicographic sorting with "low cardinality columns 
serv[ing] as the leftmost sort orders" ^l3j. We want to justify this empiri- 
cal recommendation. For uniformly distributed tables, we show that sorting 
lexicographically with the columns in increasing cardinality is asymptotically 
optimal — for large column cardinalities. Furthermore, we show how to extend 
this result to all column cardinalities. As an additional contribution, we bound 
the suboptimality of sorting lexicographically for the problem of minimizing the 
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number of runs. With this analytical bound, we show that for several realis- 
tic tables, sorting is 3-optimal or better as long as the columns are ordered in 
increasing cardinality. 

We present our results in 4 steps: modeling (§[2]), a priori bounds (§[3] and 
§[4]), analysis of synthetic cases (§ [5| and experiments (§[6|. Specifically, the 
paper is organized as follows: 

There are many possible RLE implementations. In § [2] we propose to 
count column runs as a simplified cost model. 



In § [3] we prove that minimizing the number of runs by row reordering is 
NP-hard. 

In § |4j we review several orders used to sort tables in databases : the lex- 
icographical order, the reflected Gray-code order, and so on. We regroup 
many of these orders into a family: the recursive orders. In § 4.1 we 
bound the suboptimality of sorting as an heuristic to minimize the num- 
ber of runs. In § |4.2[ we prove that determining the best column order is 
NP-hard. 

In § [5] we determine analytically the best column order for some synthetic 
cases. Specifically, in § |5.H we analyze tables where all possible tuples 
are present. In § |5.2[ we consider the more difficult problem of uniformly 
distributed tables. We first prove that for high cardinality columns, or- 
ganizing the columns in increasing cardinality is best at minimizing the 



number of runs (see Theorem [2]). In § 5.2.1 and § 5.2.2 we show how to 
extend this result to low cardinality columns for the lexicographical and 
reflected Gray-code orders. 

Finally, we verify experimentally the importance of column ordering in 
§ [6) and assess other factors such as column dependencies. We show that 
an order based on Hilbert space-filling curves [2] is not competitive to 
minimize the number of runs. 



2. Modeling RLE compression by the number of column runs 

RLE compresses long runs of identical values: it replaces any run by the 
number of repetitions followed by the value being repeated. For example, the 
sequence 11111000 becomes 5-1, 3-0. In column-oriented databases, RLE makes 
many queries faster: sum, average, median, percentile, and arithmetic opera- 
tions over several columns p|. 

There are many variations on RLE: 

• Counter values can be stored using fixed- length counters. In this case, any 
run whose length exceeds the capacity of the counter is stored as multiple 
runs. For example, Adabi et al. [T3] use a fixed number of bits for the 
tuple's value, start position, and run length. We can also use variable- 
length counters [T5ti22] or quantized codes |23j . 
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• When values are represented using fewer bits than the counter values, we 
may add the following convention: a counter is only present after the same 
value is repeated twice. 

• In the same spirit, we may use a single bit to indicate whether a counter 
follows the current value. This is convenient if we are transmitting 7-bit 
ASCII characters using 8-bit words ^il [^ . 

• It might be inefficient to store short runs using value-counter pair. Hence, 
we may leave short runs uncompressed (BBC 26J, WAH 27 or EWAH 28j). 

• Both the values and the counters have some statistical distributions. If 
we know these distributions, more efficient encodings are possible by com- 
bining statistical compression with RLE — such as Golomb coding 
Lempel-Ziv, Huffman, or arithmetic encoding. Moreover, if we expect the 
values to appear in some specific order, we can store a delta instead of 
the value |30'. For example, the list of values 00011122, can be coded 
as the (diffed-values, counter) pairs (1,3)(1,3), (1,2). This can be used to 
enhance compression further. 

• To support binary search within an RLE array, we may store not only the 
value and the repetition count, but also the location of the run [M1[5T1I5^ . 
or we may use a B-tree |33j . 

• Instead of compressing the values themselves, we may compress their bits. 
In bitmap indexes, for any given column, several bitmaps can be individ- 
ually compressed by RLE. 

It would be futile to attempt to analyze mathematically all possible appli- 
cations of RLE to database indexes. Instead, we count runs of identical values. 
That is, if is the number of runs in column i and there are c columns, we 
compute ^^^^ (henceforth RunCount). 

3. Minimizing the number of runs by row reordering is NP-hard 

We want to minimize RunCount by row reordering. Consider a related 
problem over Boolean matrices |34| : minimizing the number of runs of ones in 
rows by column reordering. This "Consecutive Block Minimization" problem 
(CBMP) is NP-hard ^ SR17],j36j5 Yet, even if we transpose the matrix, 
CBMP is not equivalent to the RunCount minimization problem. Indeed, 
both sequences 001100 and 000011 have a single run of ones. Yet, the sequence 
001100 has three runs whereas the second sequence (000011) has only two runs. 
Moreover, the RunCount minimization problem is not limited to binary data. 
To our knowledge, there is no published proof that minimizing RunCount by 
row reordering is NP-hard. Hence, we provide the following result. 



table of notation can bo found in Appendix p\l 
Another NP-hardness proof was later given byTFinar and Heath "ST". 
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(a) lexicographic order (RunCount is 16) (b) Best row order (RunCount is 14) 

Figure 1: A two-column table with different row orders and corresponding total 
numbers of runs (RunCount) 



Lemma 1. Minimizing RunCount by row reordering is NP-hard. 

Proof. We prove the result by reduction from the Hamiltonian path prob- 
lem, which remains NP-hard even if a starting and ending vertex are speci- 
fied [GT39][3S]. Consider any connected graph G having n vertices and m edges, 
and let s and t be respectively the beginning and end of the required Hamilto- 
nian path. 

Consider the incidence matrix of such a graph. There is a row for each 
vertex, and a column for each edge. The value of the matrix is one if the edge 
connects with the vertex, and zero otherwise. Each column has only two ones; 
thus it has either 

1. two runs (if the ones are consecutive, and either at the top or bottom of 
the column) 

2. three runs (if the ones are consecutive but not at the top or bottom of the 
column, or if there are ones at the top and bottom) 

3. four runs (if the ones are not consecutive, but a one is at the top or at the 
bottom), or 

4. five runs (in all other cases). 

Thus, the number of column runs in this incidence matrix is less than 5m. 

We modify the incidence matrix by adding 10m new columns. These columns 
contain only zeros, except that 5m columns have the value one on the row 
corresponding to vertex s, and 5m other columns have the value one on the row 
corresponding to vertex t (see Fig. [2]). These new columns have either 2 runs or 
3 runs depending on whether the rows corresponding to s and t are first, last or 
neither. 

Suppose that the row corresponding to s is not first or last. Then the number 
of runs in the newly added 10m columns is at least 3 x 5m -I- 2 x 5m = 25m (or 
30m if both s and t are neither first nor last). Meanwhile, the number of runs in 
the original incidence matrix is less than 5m. Thus, any row order minimizing 
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, columns 



5m columns 5m columns 



11... 11 

00 ... 00 



00 ... 00 
00 ... 00 



incidence matrix 00 ... 00 00 ... 00 



n rows 



00 ... 00 00 ... 00 
00... 00 11... 11 



Figure 2: Matrix described in the proof of Lemma [T] 



the number of runs will have the rows corresponding to s and t first and last. 
Without loss of generality, we assume s is first. 

A minimum-run solution is obtained from a Hamiltonian path from s io t 
by putting rows into the order they appear along the path. Such a solution has 
two columns with two runs, n — 3 columns with three runs (or n — 2 columns 
with three runs, if (s, t) were an edge of G), and the columns for the other edges 
in G each have five runs. Finally, the lOm added columns have two runs each. 
Yet, having so few runs implies that an s-t Hamiltonian path exists. Hence, we 
have reduced the s-t Hamiltonian path problem to minimizing the RunCount 
by row reordering. □ 

4. Lexicographic and Gray-code sorting 

While the row reordering problem is NP-hard, sorting is an effective heuristic 
to enhance column-oriented indexes [11128]. Yet, there are many ways to sort 
rows. 

A total order over a set is such that it is transitive {a <b and b < c implies 
a < c), antisymmetric {a < b and b < a implies a = b) and total {a < b or b < a). 
A list of tuples is discriminating |38j if all duplicates are listed consecutively. 
Orders are discriminating. 

We consider sorting functions over tuples. We say that an order over c-tuples 
generates an order over c— 1-tuples if and only if the projection of all sorted lists 
of c-tuples on the first c — 1 components is discriminating. When this property 
applies recursively, we say that we have a recursive order: 

Definition 1. A recursive order over c-tuples is such that it generates a recur- 
sive order over c— 1-tuples. All orders over 1-tuples are recursive. 

An example of an order that is not recursive is (1,0,0), (0,1,1), (1,0,1), since 
its projection on the first two components is not discriminating: (1,0), (0,1), 
(1,0). We consider several recursive orders, including lexicographic order and 
two Gray-code orders. 
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Figure 3: A table sorted in a (reflected) Gray-code order. Except for the first 
row, there is exactly one new run initiated in each of the iVi ^ rows (in bold). 
Thus, the table has c — 1 + Ni^c = 3— 1 + 3x2x2 = 14 column runs. 



Lexicographic order. The lexicographic order is also commonly known as the 
dictionary order. When comparing two tuples a and b, we use the first compo- 
nent where they differ {aj ^ bj but Oi = bi for i < j) to decide which tuple is 



smaller (see Fig. 4a) 



Let Ni be the cardinality of column i and n be the number of rows. Given 
all possible Ni ^ = Y[i=i tuples, we have Ni ^, runs in the last column, 
Ni c-i runs in the second last column and so on. Hence, we have a total of 
jyj=i -^i-i runs. If the Ni's have the same value Ni — N for all i's, then we 
have + iV^-i + --- + N = ^^^^ - 1 runs. 

Gray- code orders. We are also interested in the more efficient Gray-code orders. 
A Gray code is a list of tuples such that the Hamming distance — alternatively 
the Lee metric [321 — between successive tuples is one. Knuth [301 PP- 18-20] 
describes two types of decimal Gray codes. 

• Reflected Gray decimal ordering is such that each digit goes from to 9, 
up and down alternatively: 000, 001, . . . , 009, 019, 018, . . . , 017, 018, 028, 
029, . . . , 099, 090, . . . 

• Modular Gray decimal is such that digits always increase from 1 modulo 
10: 000, 001, . . . , 009, 019, 010,. . . , 017, 018, 028, 029, 020, . . . 

The extension to the mixed-radix case [5^BT] from the decimal codes is straight- 



forward [42] (see Figs. 4b and 4c) 



Because the Hamming distance between successive codes is one, if all possible 
Ni^c tuples are represented, there are exactly c — 1 + Ni^c runs, li Ni = N for 
all i, then we have c — 1 -I- N'^ runs (see Fig. [3]). All recursive Gray-code orders 
have A^i runs in the first column, N1N2 — A^i + 1 runs in the second column, 
and the number of runs in column j is given by 

r, = l + (iV,-l)7Vi^,_i. (1) 
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(Being a recursive order, the values from the first j — I columns form iVi j_i 
blocks, where rows in each block agree on their first j — I components. Being a 
Gray-code order, at any transition from one block to the next, values in column 
j must match.) If we assume iV^ > 1 for all i G {1, . . . , c}, then later columns 
always have more runs. 

From a software-implementation point of view, the lexicographic order is 
more convenient than the reflected and modular Gray codes. A common ap- 
proach to sorting large files in external memory is divide the file into smaller 
files, sort them, and then merge the result. Yet, with these Gray codes, it is 
not possible to sort the smaller files independently: a complete pass through 
the entire data set may be required before sorting. Indeed, consider the these 
two lists sorted in reflected Gray-code order: 

• Anna Awkland, Anna Bibeau, Greg Bibeau, Greg Awkland; 

• Bob Awkland, Bob Bibeau. 

Because we sorted the first list without knowing about the first name "Bob" a 
simple merging algorithm fails. For this reason, it may be faster to sort data by 
the lexicographic order. 

Similarly, while a binary search through a lexicographically sorted list only 
requires comparing individual values (such as Bob and Anna), binary searches 
through a reflected or modular Gray-code ordered list may require the complete 
list of values in each column. 

Non-recursive orders. There are balanced and nearly balanced Gray codes [401 
231131]. Unhke the other types of Gray codes, the number of runs in all columns 
is nearly the same when sorting all possible tuples for A^i — N2 = ... = Nc- 
However, they cannot be recursive. 

Some authors have used Hilbert space-filling curves to order data points [HI 



1351135] (see Fig. 4d|. This order is not recursive. Indeed, the following 2-tuples 
are sorted in Hilbert order: (1,1), (2,1), (2, 2), (1,2). Yet, their projection on 
the first component is not discriminating: 1, 2, 2, 1. It is a balanced Gray 
code when all column cardinalities are the same power of two [M]. Beyond 
two dimensions, there are many possible orders based on Hilbert curves [47] . 
There are also many other alternatives such as Sierpiiiski-Knopp order, Peano 
order [35], the Gray-coded curve [3S], Z-order [5D] and H- index ^51]. They are 
often selected for their locality properties [52] , 

If not balanced, non-recursive orders can be column- oblivious if the number 
of runs per column is independent of the order of the columns. As a trivial 
example, if you reorder the columns before sorting the table lexicographically, 
then the initial order of the columns is irrelevant. 



4-.1- Significance of column order 

Recursive orders depend on the column order. For lexicographic or reflected 
Gray-code orders, permuting the columns generates a new row ordering. The 
next proposition shows that the effect of the column ordering grows linearly 
with the number of columns. 
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(a) Lexicographic 



(b) Reflected Gray-code 





(c) Modular Gray-code (d) Compact Hilbert Index 

Figure 4: Various orderings of the points in a two-dimensional array 



Proposition 1. For tables with c columns, the number of column runs after 
the application of any recursive- order function can vary by a factor arbitrarily 
close to c under the permutation of the columns. 

Proof. The proof is by construction. Given a recursive-order function, we find 
a c-column table that has many runs when processed by that function. However, 
swapping any column with the first yields a table that — recursively sorted in 
any way — has few runs. 

Consider a column made of n distinct values, given in sorted order: A, B, C, 
D, . . . This column is the first column of a c-column table. For every odd row, 
fill all remaining columns with the value 0, and every even row with the value 
1: 



A 

B 1 

C 

D 1 



This table has nc runs and is already sorted. But putting any other column 
first, any recursive order reduces the number of runs to n-f 2(c— 1). For n large, 
^^2{c-i) ~^ ^ which proves the result. □ 
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Figure 5: A table such that no recursive ordering is optimal. 

The construction in the proof uses a high cardinality column. However, we 
could replace this single high cardinality column by [logjy n\ columns having a 
cardinality of at most N, and the result would still hold. 

Hence, recursive orders can generate almost c times more runs than an opti- 
mal order. Yet, no row-reordering heuristic can generate more than c times the 
number of column runs than the optimal ordering solution: there are at least 
n column runs given n distinct rows, and no more than cn column runs in total. 
Hence — as row-reordering heuristics — recursive orders have no useful worst-case 
guarantee over arbitrary tables. We shall show that the situation differs when 
column reordering is permitted. 

Suppose we consider a sorting algorithm that first applies a known reorder- 
ing to columns, then applies some recursive-order function. The proposition's 
bound still applies, because we can make an obvious modification to the con- 
struction, placing the non-binary column in a possibly different position. The 
next refinement might be to consider a sorting algorithm that — for a given 
table — tries out several different column orders. For the construction we have 
used in the proof, it always finds an optimal ordering. 

Unfortunately, even allowing the enumeration of all possible column reorder- 
ings is insufficient to make recursive ordering optimal. Indeed, consider the table 
in Fig. [5] The Hamming distance between any two consecutive tuples is one. 
Thus each new row initiates exactly one new run, except for the first row. Yet, 
because all tuples are distinct, this is a minimum: a Hamming distance of zero 
is impossible. Thus, this row ordering has a minimal number of column runs. 
We prove that no recursive ordering can be similarly optimal. 

We begin by analyzing the neighbors of a tuple, where two tuples are neigh- 
bors if they have a Hamming distance of one: 

• The tuple (K,Y) has only one neighbor: (A,Y). 

• The tuple (H,J) has only one neighbor: (H,C). 
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• The tuples of the form (Z,-) only have neighbors of the form (A,-). 

• The tuples of the form (•, E) only have neighbors of the form (-,0). 

In effect, we must consider all Hamiltonian paths in the graph of neighbors. 
The ordered list must begin and end with (K,Y) and (H,J), if it is optimal. 
A recursive order must be discriminating on the first column. Without loss of 
generality, suppose that the list begins by (K,Y). Thus, all tuples of the form 
(A,-) must follow by recursivity. Then tuples of the form (Z,-) must follow. At 
this point, we cannot continue the list by jumping from neighbor to neighbor. 
Hence, no recursive ordering is optimal. A similar argument shows that flipping 
the two columns leads to the same result: no recursive ordering can be optimal. 

Lemma 2. There are tables where no recursive order minimizes the number of 
runs — even after reordering the columns. 

Determining a tight bound on the suboptimality of recursive ordering re- 
mains open. Recursive orders applied to the example of Fig. [5] generate at least 
15 runs whereas 14 runs is possible, for a suboptimality ratio of y|. If we al- 
low arbitrarily long two-dimensional tables, we can generalize our construction 

to obtain ratios arbitrarily close to i|. Thus, the suboptimality ratio of re- 

1 

cursive orders ranges between j| and c. However, a computer search through 
100,000 uniformly distributed tridimensional tables with 10 rows and six distinct 
column values failed to produce a single case where recursive ordering is sub- 
optimal. That is, among the row orderings minimizing the number of runs, at 
least one is recursive after some reordering of the columns. Hence, it is possible 
that recursive ordering is rarely suboptimal. 

The next proposition gives a simple suboptimality bound on any recursive 
order. This result implies that recursive ordering is 3-optimal or better for 
several realistic tables (see Table [2|. 

Proposition 2. Consider a table with n distinct rows and column cardinali- 
ties Ni for i = l,...,c. Recursive ordering is fi- optimal for the problem of 
minimizing the runs where 

ELimin(n,iVij) 

M = — • 

n + c — 1 

The bound can be made stronger if the recursive order is a Gray code: 

^;^imin(n,l + (A^, -l)iVi,,_i) 

MGC = — z. ■ 

n + c — 1 

Proof. Given a table in any recursive order, the number of runs in the 
i^^ column is bounded by Ni,i and by n. Thus the number of runs in the 
table is no more than min(A'^i,n) + min(iVi_2, ?^) + • • • -|- min(A^i.c, rt). Yet, 
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there are at least n + c — 1 runs in the optimally-ordered table. Hence, the 
result follows. The tighter bound for Gray-code orders follows similarly, and 
the relationship between /ice and ^ follows by straightforward algebra (/j.gc > 
min(/3n, ^N,,) where /3 = ° 
As an example, consider the list of all dates (month, day, year) for a cen- 
tury (A^i = 12, N2 = 31, A^a = 100, n = 12 x 31 x 100): then /i w 1.01 so that 
lexicographic sorting is within 1% of minimizing the number of runs. The op- 
timality bound given by Proposition [2] is tighter when the columns arc ordered 
in non-decreasing cardinality (iVi < N2 < ■ ■ ■ < Nc) . This fact alone can be an 
argument for ordering the columns in increasing cardinality. 

4-2. Determining the optimal column order is NP-hard 

For lexicographic sorting, it is NP-hard to determine which column ordering 
will result in least cost under the RunCount model, even when the tables have 
only two values. We consider the following decision problem: 

Column-Ordering-for-Lex-Runcount ( COLR). Given table T with binary values 
and given integer K, is there a column ordering such that the lexicographically 
sorted T has at most K runs? 

Theorem 1. COLR is NP-complete. 

Proof. Clearly the problem is in NP. Its NP-hardness is shown by reduction 
from the variant of Hamiltonian Path where the starting vertex is given [351 
GT39]. Given an instance {V, E) of Hamiltonian Path, without loss of generality 
let vi he the specified starting vertex. We construct a table T as follows: 
first, start with the incidence matrix. Let V — {wi, W2, . . . , and E = 

{ei, 62, Cm}. Recall that this matrix has a column for each edge and 
a row for each vertex; a^j = 1 if edge ej has vertex Vi as an endpoint and 
otherwise atj — 0. Vertex vi corresponds to the first row. We prepend and 
append a row of zeros to the incidence matrix. Next we prepend h columns 
with values lO'^'"'"^ (i.e., 100. . .0) and h columns with llO'^'; see Fig. [ojfor an 
example. The value of h is "large" ; we compute the exact value later. 

We show the resulting instance, with table T and bound K = Ah^ 3(|1^| — 
1) -|- 5(m — \V\ + 1)), satisfies the requirements for COLR if and only if (V, E) 
contains a Hamiltonian path starting at vi. 

First, suppose that we have a suitable Hamiltonian path in {V,E). Let 
Ci € E he the i*'' edge along this path. Edge ei is incident upon vi. 

Reorder the columns of T: leave the first 2h columns in their current order. 
Next, place the columns corresponding to ti in order ei, 62, . . . , e|y|_i. The 
remaining columns follow in an arbitrary order. See Fig.[7j where it is apparent 
that the constructed table is already lexicographically sortecj^ Also, the first 2h 
columns have 2 runs each, and the |y| — 1 columns for have three runs each 



We sort with 1 ordered before 0. 
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(b) Constructed table, incidence matrix shaded. 

Figure 6: Table built from graph on the left. There are h copies of the column 
that begins 10 . . . and the column that begins with 11 . . .. 
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Figure 7: A lexicographically sorted table with the required RunCost bound 
is obtained from the Hamiltonian path consisting of edges ei, 64, 62, 66- 



( each has the value OniOl^l"'). The remaining m — |F| + 1 columns have five 
runs each: all patterns with adjacent ones have been used (and there are no 
duplicates); hence, all remaining patterns are of the form 0+10+10"'". Thus the 
bound is met. 

Next, suppose T satisfies the requirements of COLR with the given bound 
K = 4h + 3(V| - 1) + 5(m - \V\ + 1). We show this implies {V,E) has a 
Hamiltonian path starting with vi . 

If h is large enough, we can guarantee that the first two rows have not 
changed their initial order. This is enforced by the 2h columns that were initially 
placed leftmost. Their column values must end with (the row of zeros is 
always last after lexicographic sorting). If we analyze the RunCount cost of 
these columns, they cost Ah when the first two rows remain in their initial order, 
otherwise they cost 5h or 6h. If h is large enough, this penalty will outweigh 
any possible gain from having a column order that, when sorted, moves the first 
two rows. 

Knowing the first row, we deduce that every column begins with a one if it 
is one of the 2h columns, but it begins with a zero in every remaining column. 
We now focus on these remaining columns, which correspond to edges in E. 
Since each column value begins and ends with zero and has exactly two ones. 
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Table 1: Total number of runs for complete tables. 



ordering 



number of runs (RunCount) 



Gray-code 
lexicographic 



c-l + A^i,c 



its pattern is either 0+110+ (3 runs) or 0+10+10+ (5 runs). The specified 
RunCount bound implies that we must have — 1 columns with 3 runs. The 
edges for these columns form the desired Hamiltonian path that starts at vi. 

To finish, we must choose h such that the penalty (for choosing a column 
ordering that disrupts the order of the first two rows after lexicographic sorting) 
exceeds any possible gain. The increased cost from 4/i is at least 5/i, a penalty 
of at least h. An upper bound on the gain from the other columns is 3m because 
the RunCount is no more than 5m and cannot be decreased below 2m. Choose 
/i = 3m + l. □ 

This result can be extended to the reflected Gray-code order and, we con- 
jecture, to all recursive orders. 

A related problem tries to minimize the maximum number of runs in any 
table column. This problem is also NP-hard (see Appendix [d| . 

5. Increasing-cardinality-order minimizes runs 

Consider a sorted table. The table might be sorted in lexicographic order 
or in reflected Gray-code order. Can we prove that sorting the columns in 
increasing cardinality is a sensible heuristic to minimize the number of runs? 
We consider analytically two cases: (1) complete tables and (2) uniformly dis- 
tributed tables. 

5.1. Complete tables 

Consider a c-column table with column cardinalities Ni, N2, . . . , N^.. A com- 
plete table is one where all Ni^c possible tuples are present. In practice, even 
if a table is not complete, the projection on the first few columns might be 
complete. 

Using a lexicographic order, a complete table has X]j=i^i,i runs, hence 
the RunCount is minimized when the columns are ordered in non-decreasing 
cardinality: Ni < Ni+i for i = 1, . . . , c — 1. Using Gray-code ordering, a 
complete table has only c — 1 + Ni^c runs (the minimum possible) no matter 
how the columns are ordered. Hence, for Gray-code, the RunCount of complete 
tables is not sensitive to the column order. 

Somewhat artificially, we can create a family of recursive orders for which 
the RunCount is not minimized over complete tables when the columns are 
ordered in increasing cardinality. Consider the following family: "when Ni is 
odd, use reflected Gray code order. Otherwise, use lexicographic order." For 
A^i = 2 and N2 ~ 3, we have 8 runs using lexicographic order. With A^i = 3 
and N2 = 2, we have 7 runs using any recursive Gray-code order. Hence, we 
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N.TflQ.Q. 



23456789 10 
Number of columns (r ) 

Figure 8: Relative benefits of Gray-code sorting against lexicograplric orders for 
complete c-column table where all column cardinalities are N 



cannot extend our analysis to all families of recursive orders from Gray-code and 
lexicographic orders. Nevertheless, if we assume that all column cardinalities 
are large, then the number of runs tends to -/Vi^c and all column orders become 
equivalent. 

The benefits of Gray-code orders — all Gray-code orders, not just recursive 
Gray-code orders — over lexicographic orders are small for complete tables having 
high cardinalities as the next proposition shows (see Fig. [s]) . 

Proposition 3. Consider the number of runs in complete tables with columns 
having cardinality N . The relative benefit of Gray- code orders over lexicographic 
orders grows monotonically with c and is at most 1/N. 

Proof. The relative benefits of Gray-code sorting for complete tables with all 

columns having cardinality N is — " ^,^c+i_i ■ As c grows, this quantity 

converges to from below. □ 

5.2. Uniformly distributed case 

We consider tables with column cardinalities Ni, N2, . . . Nc- Each of the 
Ni^c possible tuples is present with probability p. Whenp = 1, we have complete 
tables. 

For recursive orders over uniformly distributed tables, knowing how to com- 
pute the expected number of runs in the second column of a two-column table 
is almost sufficient to analyze the general case. Indeed, given a 3-column table, 
the second column behaves just like the second column in 2-column table with 
p 1 — (1 — p)^^ . Similarly, the third column behaves just like the second 
column in a 2-column table with Ni •(— N1N2 and N2 <— N3. 
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This second column is divided into TVi blocks of N2 tuples, each tuple having 
a probability p of being present. The expected number of tuples present in the 
table is N1N2P. However, N1N2P is an overestimate of the number of runs in 
the second column. Wc need to subtract the expected number of seamless joins 
between blocks: two blocks have a seamless join if the first one terminates with 
the first value of the second block. The expected number of seamless joins is 
no larger than the expected number of non-empty blocks minus one: Nip^.^ — 1 
where pjvj = 1 — (1 — p)^^- While for complete tables, all recursive Gray-code 
orders agree on the number of runs and seamless joins per column, the same is 
not true for uniformly distributed tables. Different recursive Gray-code orders 
have different expected numbers of seamless joins. 

Nevertheless, we wish to prove a generic result for tables having large column 
cardinalities {Ni ^ 1 for all z's). Consider a two-column table having imiform 
column cardinality N. For any recursive order, the expected number of seamless 
joins is less than Np^. However, the expected sum of the number of runs and 
seamless joins is Np^ in the first column and N'^p in the second, for a total of 
NpN + N^p. For a fixed table density, the ratio N p^ / {N pM + N^p) goes to 
zero as 1/N since pjv — 1 exponentially. Hence, for tables having large column 
cardinalities, the expected number of seamless joins is negligible compared to 
the expected number of runs. The following lemma makes this result precise. 

Lemma 3. Let Si and Ri be the expected number of seamless joins and runs in 
column i. For all recursive orders, we have 

Ei=i St + J2i=i minfcg{i_2,...,c} Nk 

over uniformly distributed tables. 

Proof. Column i -\- \ has an expected total of runs and seamless joins of 
Si+i + Ri+i = iVi^j+i/!jjV;_^2..-Afc- has less than Ni^iPNi^i...Nc seamless joins. 
We can verify that PNi+i...Nc ^ pNi+2—Nc ^ov all p G [0, 1]. Thus 6'i+i/(S'i+i + 
Ri+i) < 1/Ni+i. 

Hence, we have NiSi < Si + Ri. This implies that min;;g|i^2,...,c} ^kSi < 
Si + Ri. Therefore, we have minfeg{i^2,...,c} -^fe Z^i=i < X^^^i Si + YTi^i 
which proves the result. □ 

Therefore, for large column cardinalities, we can either consider the number 
of runs, or the smn of the runs and seamless joins. In this context, the next 
proposition shows that it is best to order columns in increasing cardinality. 

Proposition 4. The expected sum of runs and seamless joins is the same for 
all recursive orders. Moreover, it is minimized over uniformly distributed tables 
if the columns are sorted by increasing cardinality. 

Proof. For all recursive orders, the expected number of runs and seamless joins 
for columns i and i + 1 is Ni^ip^i^^.-Nc + Ni,i+iPNi+2---Nc- The second term — 
corresponding to column i + 1 — is invariant under a permutation of columns 
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i and i + 1. We focus our attention on the first term: Ni^ip]\j.^^...N^. After 
permuting i and « + 1, it becomes Ni^i-iNi^ipjs[^Ni+2---N^- 

To simplify the notation, rewrite pNi+i---N^ and pNiNi+2---N^ as pNi+i and 
Pn- by substituting PNi+2---N^ for p and let i = 1. Thus, we compare Nipp^^ and 

To prove the result, it is enough to show that Nip^^ < N2PN1 implies 
A^i < N2 for p £ (0,1]. Suppose that it is not the case: it is possible to 
have N1PN2 < N2PN1 and A^i > N2- Pick such Ni,N2. Let x = 1 — _p, then 
N1PM2 — N2PN1 is Ni{l — x^^) — N2{1 — x^^). The polynomial is positive 
for X = since A^i > A^2- Because Nip^^^ < N2PN1 is possible (for some 
value of x), the polynomial must be negative at some point in (0, 1), hence it 
must have a root in (0, 1). However, the polynomial has only 3 terms so that 
it cannot have more than 2 positive roots (e.g., by Descartes' rule of signs). 
Yet, it has a root of multiplicity two at a; ~ 1: after dividing by x — 1, we 
get Ni{l + X + ■ ■ ■ + x^^^^) — iV2(l + a; + • • • + x^'^^^) which is again zero at 
X = 1. Thus, it has no such root and, by contradiction, Nip^^ < N2PN1 implies 
A^i < N2 for p e (0, 1]. The proof is concluded. □ 

Theorem 2. Given 

1. the expected number of runs in a table sorted using any recursive order 
with an ordering of the column in increasing cardinality and 

2. /^optimal ^ smallest possible expected number of runs out of all possible 
recursive orders on the table (with the columns ordered in any way), 



then 



^optimal 

< 



R'^ minfcg{i^2,...,c} Nk 

over uniformly distributed tables. That is, for large column cardinalities — 
minfcg{i_2,...,c} -^fc is large — sorting a table recursively with the columns ordered 
in increasing cardinality is asymptotically optimal. 

Proof. Whenever a> b, then 1 — a < 1 — 6. Applying this idea to the statement 
of Lemma [3j we have 



or 



Y.i=i + Y.i=i minfeg{i^2,...,c} Nk 

min/ce{i,2,...,c} Nk - 1 



> 

mmfeg{i^2,...,c} Nk 
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Let 5^ and 5'°p*™'^' be the expected number of seamless joins corresponding to 
and i?°pt™ai^ Yiave 



^optimal \, ^^^A:G{1,2,....c} 1 /^optimal | ^optimalj 

niinfcg{i^2,...,c} Nk 
> [R^ + by Prop, i 

^ ininfcg{i^2....,c} ~ ^ j^t 
minfeg{i^2,...,c} ^^/c 

from which the result follows. □ 

From this theorem, we can conclude that — over uniformly distributed tables 
having large column cardinalities — sorting lexicographically with the column 
ordered in increasing cardinality is as good as any other recursive sorting. 

The expected benefits of seamless joins are small, at least for uniformly 
distributed tables. Yet, they cause runs from different columns to partially 
overlap. Such partial overlaps might prevent some computational optimizations. 
For this reason, Bruno [53] avoids seamless joins in RLE-compressed columns: 
each seamless join becomes the start of a new run. In this model, Proposition [4] 
already shows that ordering the columns in increasing cardinality minimizes the 
expected number of runs — for uniformly distributed tables. 

5.2.1. Best column order for lexicographic sorting 

While Theorem [2] states that the best column ordering — for all recursive 
orders — is by increasing cardinality, the result is only valid asymptotically (for 
large column cardinalities). For the lexicographic order, we want to prove that 
the best column ordering is by increasing cardinality, irrespective of the column 
cardinalities. 

The A'^i blocks in the second column of a lexicographically ordered are or- 
dered from 1 to N2. Let P11N2 be the probability that any two non-empty such 
blocks have a seamless join. The probability that the first x tuples in a block 
are not present whereas the x + l"^ tuple is present is (1 —vfvl (1 ~ (1 ~V)^'^^- 
To obtain a seamless join, we need a run of precisely A'2 — 1 missing tuples, 
and it can begin anywhere between the second possible tuple in the first block 



and the first possible tuple in the second block. (See Fig. 9a ) Hence, we have 



-PiiW2 = (i-(i-p)iV2)2 = (i-p)p^^ ■ Let P}XN2,v' and pAr^.p' be P^^n^ and pat^ 
with p' substituted in place of p. 

To prove that ordering the columns by increasing cardinality minimizes the 
number of runs, it is enough to prove that permuting the columns two-by-two, 
so as to put the column with lesser cardinality first, never increases the number 
of runs. To prove this result, we need the following technical lemma. 

Lemma 4. For 1 < Af2 < ^3 < 30 and < p < 1, we have 

(1 - P\XNz)PN^^1 - P\iN2..PN.PN2,PN., < (1 - P[iN2)PN2N:i - P[iN3.PN2PN3..PN2 ' 
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[1 2---fc— jV2][l^^fc---iV2] 

N2-k 
(a) Lexicographic 

[1 2---k---N2^\N2 iVz - l--- fc---l] 

ATa-fe Af2-fc 
(b) Reflected GC (successive blocks) 

[1 2 ■ ■ ■ fc ■ ■ ■N2]\ N2 1^2 ■ ■ ■ fc ■ ■ ■ iVa ~ 1] 

^2-*: fc mod N.2 
(c) Modular GC (successive blocks) 

[1 2---fc-_JV2][ s s + fc---s-l] 

Ar2-fc (fc-l+y) mod ATj 
(d) Modular GC (separated by i^ — 1 blocks) 

Figure 9: Two consecutive non-empty blocks and the number of missing tuples 
needed to form a seamless join. The last figure shows the pattern where y — 1 
empty blocks separate the two non-empty blocks, and the count sequence in the 
second block starts at s = 1 + (— y mod iV2). 



Proof. Observe that 1 — = (1— p)^^ and pn2,pn^ = 1 — (1— p)^^^-' = Pn^n^- 
To prove the result, we show that: 

• For p sufficiently close to 1, the result holds. 

• We can turn the inequality into a polynomial in p with no root in (0, 1). 

The first item is easy: taking the limit as p — > 1 on both sides of the 
inequality, we get N2 < N3. To conclude the proof, we have to show that 

(1 - P.llN3)PN3N2 - PllN2,PN3PN2,PN3 ~ (1 - ^lLA^2)PAf2^3 + PllN3,PN2PN3,PN2 

never zero for p G (0, 1). We multiply this quantity by ppf^N^. We proceed to 
show that the result is a polynomial. 

Since 1 — = (1 — z){l + z + ■ ■ ■ z^"^), we have that the polynomial 
PN2N3 is divisible by both p^^ and p^.^ by respectively setting z = (1 — p)^^ 
and z ^ (1 -_p)^-'. Hence, (1 - Piin3)pn3PN2N3 and (1 - Piin2)pn2PN2N3 are 
polynomials. 

We also have that Piin2.pn3PN2,pn3 ~ ^''^"^pw w"^^ ^^'^ similarly for 

PliN3,PN2PN3,PN2 ^^^^ {P[iN2,PN3PN2,PN3 ^ P\iN3 ,p N2P ^3 -.P N2) P N2N3 IS & POly- 

nomial. 

Hence, for any given N2 and iVs, we can check that the result holds by 
applying Sturm's method (54j to the polynomial over the interval (0, 1]. Because 
there is no root at p = 1, we have to check that the total root count over (0, 1] 
is always zero. We proved this result using a computer algebra system (see 
Appendix |b| for values of N2 and iVa up to 30. This concludes the proof. □ 

There are A^i — 1 pairs of blocks immediately adjacent, A^i — 2 pairs of blocks 
separated by a single block, and so on. Hence, the expected number of seamless 
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joins in the second column isj^S'j^^'^^ = P1IN2PI12 Ef=o ^(^1 ~ 1 " " PN2)'' 
or = P^i^nApn.N, + (1 - patJ^i - 1) = PiiN^PN^Ni + e for |e| < 1. 

Proposition 5. Consider a table with c independent and uniformly distributed 
columns having cardinalities Ni,N2,...,Nc (let 2 < Ni < iV^+i < 30 for 
i = 1, . . . ,c — 1). We can sort the table by lexicographic order according to 
various column orders. The column order Ni, N2, ■ ■ ■ , Nc minimizes the number 
of column runs — up to a term no larger than c in absolute value. 

Proof. Define Tj^^^" = N.N^Pn, - Pun, ,pn3PN2,pn2^i *s the number of 
expected number of runs — up to a constant term no larger than one in absolute 
value — in the second column of a 3-column table with cardinalities Ni,N2, -/V3 
and uniform distribution. Define Ti^X^3,p, T]^^%°,,p^^ and T^^^X^v^.p sim- 
ilarly. It is sufiicient to prove that T]^:%1,,^ +T'^^ilj,,,, < T'^^nIp^^ + 
'^NiN°,N2,p whenever N2 < N3, irrespective of the value of Ni (allowing Ni > 
N3). We have 

= NiN2PN:i - PllN2,PN3PN2,PN3^1 

+N1N2N3P - PiiNsPNsNiN2 
= i'^-PllN,)pN,N,N2 

-P[iN2.PNsPN2..PN3^l 

+N1N2N3P 
< {l~PllN2)PN2NiN3 

-PU-Ns .p«2 PNs ,PN2 ^ 1 

+N1N2N3P (by Lemma|4| 

= N1N3PN2 - P.llN3.PN2PN3,PN2^i 

+NIN2N3P - PIIN2PN2N1N3 

rplcxico I T-ilcxico 

— ^Ni,N3,pN2 ~^ ^NiN3,N2,p- 

We conjecture that a similar result would hold for all values of Ni larger than 
30. Given arbitrary values of A^i, . . . , we can quickly check whether the 
result holds using a computer algebra system. 

5.2.2. Best column order for reflected Gray-code sorting 

For the reflected Gray-code order, we want to prove that the best column 
ordering is by increasing cardinality, irrespective of the column cardinalities. 
Blocks in reflected Gray-code sort are either ordered from 1 to 7V2, or from N2 to 
1. When two non-empty blocks of the same type are separated by empty blocks, 
the probability of having a seamless join is P\xN2- Otherwise, the probability of 



*We use the identity EL~o^(JV - 1 - k)x*' = 



Tilexico 1 t-tIcxIco 

-'-Ni,N2,PN3 + ^NiN2,N3,p 



This proves the result. □ 
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seamless join is P^n, = /I ^117/4)2^ ' = ^i-(C%N%2fCii^py) 

P€(0,1). 

There are A^i — 1 pairs of blocks immediately adjacent and with opposite 



orientations (e.g., from 1 to A''2 and then from iV2 to 1; see Fig. 9b), A^i — 2 pairs 
of blocks separated by a single block and having identical orientations, and so 
on. Hence, the expected number of seamless joins is P^N2P%2 Si^=o^ ^^^^ _ 

1 - 2fc)(l - pM^r + PliN.pl, EiL^o "'^^'^ (TVi - 2 - 2fc)(l - p^2Y^+\ 

We want a simpler formula for the number of runs, at the expense of intro- 
ducing an error of plus or minus one run. So consider the scenario where we have 
an infinitely long column, instead of just A^i blocks. However, we count only 
the number of seamless joins between a block in the first iVi blocks and a block 
following it. Clearly, there can be at most one extra seamless join, compared to 
the number of seamless joins within the iVi blocks. 

We have the formula xy^?°n(l — xY'^ = — rr — vr = tt^. Hence, this 

new number of seamless joins is S'^f^'j^f'^ = ^$Af2PAr2 Sfclo ~ Pn^Y^ + 

P\iN2PN2 2^k=0^l{^~ PN2) ^ " 2-p», + • 

T „f \roflcctcd _ -PtW2 +(1~PJV2 )Pll'^2 crcflcctcd „ \ reflected „ Ar 

Let A^^ - ; then ijVi^AT^ - Ajv^ PJVaJvi. 

Lemma 5. The function ^^^^ is a polynomial in x for all positive integers 

Proof. Polynomials of the form — 1 have exactly N distinct roots given 
by e^^*''/^ for fc = 0, 1, . . . , — 1. Hence, we can factor them as {x — l){x — 
g27ri/7V-| ... J- 2; _ ^2TTi(N-i)/Ny gy inspection, we have that all of the roots of the 
polynomial 1 — x^^ are also roots of the polynomial 1 — x'^^^'-' , hence the result. 
□ 

Lemma 6. If 2 < N2 < N^i < 30, then 

( -t \ reflectcd\ at- \ reflected ^ /i \ reflectcd\ 7\r \ reflected 

Proof. The proof is similar to the proof of Lemma |4j We want to show that: 

• For some value of p in (0,1), the result holds. 

• We can turn the inequality into a polynomial in p with no root in (0, 1). 

The first item follows by evaluating the derivative of both sides of the in- 
equality at p = 1. (Formally, our formula is defined for p G (0, 1), so we let the 
values and derivatives of our functions at 1 be implicitly defined as their limit 
as p tends to 1.) For all > 2 and at p = 1, we have that = 2, = 0, 

^ = 0, and ''^"^'J'"" = 2; moreover, we have pN = 1 and A5^*''='='°'^ = 1 at 

p ~ I. The derivatives of P^N,pN', PiiN,pN' and y^N^p^'^'^ are also zero for all 
A^, A^' > 2 at p = 1. Hence, the derivative of the left-hand-side of the inequality 
at p = 1 is —2N2 whereas the derivative of the right-hand-side is —2N^. Because 
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Ns > N2 and equality holds at p — 1, we have that the left-hand-side must be 
smaller than the right-hand-side at p = 1 — ^ for some sufficiently small f > 0. 

To conclude the proof, we have to show that the value (1 — A5^^°'^*'''^)p7V3^2 ~ 
^Cptf /^^=^p«3 - (1 - >^nT'''''')pn,N, + ^N!%fPN.,P.. is never zero for p G 
(0, 1). We multiply this quantity by {2~ pn2N3)pn2N3 and call the result T. We 
first show that T is a polynomial. 

Because PjjVa PATg ^^'^ PuNsPI;.^ are polynomials (respectively p^ + (l—p)^p'^+ 
• • • + (1 V and N2p'^{l -p)^'"^), we have that A5^^'='=*°'^ can be written 

as a polynomial divided by (2 - Pn3)Pn3- Hence, X'^^'"'^'"^pn3{2 - Pn2N3)PN2N3 
is a polynomial times (2-p"^)pw^"^ • turn, this fraction is "^^-^^^-^''^a^ 
which is a polynomial by Lemma [sj Hence, X^^^^'^^'^'^p^si^ — Pn2N3)pn2N3 is a 
polynomial. By symmetrical arguments, AJ^^^'^'^^^'^pAr^ (2 — PN2N3)pn2N3 is also a 
polynomial. 

Recall that Pn2,pn3 = PN2N3- We have that X'^^^'^p^^^ is a polynomial divided 
by {2- pn2N3)Pn2N3- Hence, it is immediate that X'^f'^^l''^pN2,pN3 multiplied by 
(2 — Pn2N3)pn2N3 is polynomial, merely by canceling the terms in the denomi- 
nator. A symmetrical argument applies to X'j^^'^p^^pN3,pN2- 

Hence, T is a polynomial. As in Lemma |4j for any given A'2 and A3, we can 
check that there are no roots by applying Sturm's method to the polynomial 
over the interval (0, 1]. Because there is a root at p = 1, it is sufficient to check 
that the total root count over (0, 1] is always one. (Alternatively, we could first 
divide the polynomial by x — 1 and check that there is no root.) We proved this 
result using a computer algebra system (see Appendix [b|) . This concludes the 
proof. □ 

Proposition 6. Consider a table with c independent and uniformly distributed 
columns having cardinalities Ni,N2, ■ ■ ■ ,Nc (let 2 < Ni < N^^i < 30 for i = 
1, . . . , c — 1). We can sort the table by reflected Gray-code order according to 
various column orders. The column order Ni, N2, . . . , Nc minimizes the number 
of column runs — up to a term no larger than c in absolute value. 

Proof. The proof is similar to Proposition [5| see Appendix [Cj □ 

6. EXPERIMENTS 

To complete the mathematical analysis, we ran experiments on realistic data 
sets. We are motivated by the following questions: 



For columns with few columns, is recursive sorting nearly optimal? (§ 6.3 1 



How significant can the effect of the column order be? Are refiected Gray- 



code orders better than lexicographical orders? (§ 6.4) 

How does an Hilbcrt order compare to lexicographical orders? (§ 6.5 ) 

How large is the effect of skew and column dependency? (§ |6.7|) 
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Table 2: Characteristics of data sets used 







rows 


distinct rows 


cols 




size 




Census-Income 




199 523 


178 867 


4 


102 609 


2.96 MB 


2.63 


Censusl881 


4 


277 807 


4 262 238 


7 


343 422 


305 MB 


5.09 


DBGEN 


13 


977 980 


11996 774 


4 


402 544 


297 MB 


1.02 


Netflix 


100 


480 507 


100 480 507 


4 


500 146 


2.61GB 


2.00 


KJV-4grams 


877 


020 839 


363 412 308 


4 


33 553 


21.6 GB 


2.19 



6.1. Software 

We implemented the various sorting techniques using Java and the Unix 
command sort. For all but lexicographic ordering, hexadecimal values were 
prcpended to each line in a preliminary pass over the data, before the com- 
mand sort was called. (This approach is recommended by Richards [42].) Be- 
side recursive orders, we also implemented sorting by Compact Hilbert Indexes 
(henceforth Hilbert) [Uj — also by prepending hexadecimal values. By default, 
we order values within columns alphabetically. 

6.2. Realistic data sets 

We used five data sets (see Table [2]) representative of tables found in applica- 
tions: Census-Income [55], Censusl881 'SB], DBGEN [57!, Netflix [58^ and KJV- 
4grams |28j . The Census-Income table has 4 columns: age, wage per hour, divi- 
dends from stocks and a numerical valuc[^ found in the 25'^ position of the orig- 
inal data set. The respective cardinalities are 91, 1240, 1478 and 99 800. The 
Censusl881 came from a publicly available SPSS file 1881_sept2008_SPSS.rar [55] 
that we converted to a flat file. In the process, we replaced the special values 
"ditto" and "do." by the repeated value, and we deleted all commas within 
values. The column cardinalities are 183, 2127, 2 795, 8 837, 24 278, 152 365, 
152 882. For DBGEN, we selected dimensions of cardinality 7, 11, 2 526 and 
400 000. The Netflix table has 4 dimensions: UserlD, MovielD, Date and Rat- 
ing, with cardinalities 480 189, 17 770, 2 182, and 5. Each of the four columns of 
KJV-4grams contains roughly 8 thousand distinct stemmed words: 8 246, 8 387, 
8 416, and 8 504. 

Table [2] also gives the suboptimality factor /i from Proposition [2] For DB- 
GEN, any recursive order minimizes the number of runs optimally — up to a 
factor of 1%. For Netflix and KJV-4-grams, recursive ordering is 2-optimal. 
Only for Censusl881 is the bound on optimality significantly weaker: in this 
instance, recursive ordering is 5-optimal. 

6.3. Recursive sorting is "safe" for low dimensionality 

Since our 7-dimensional data set yields a much looser bound than the 4- 
dimensional data sets, we investigate the relationship between and the number 



^The associated metadata says this column should be a 10-valued migration code. 
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p = 10 
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number of columns (r ) 



(b) Synthetic data sets 



Figure 10: Approximate /x versus columns, when sampling projections of re- 
alistic data sets and synthetic data sets (10-dimensional uniformly distributed 
table with Ni ^ N2 ^ ■ ■ ■ ^ ^ 10) 



of dimensions. Rather than use our arbitrarily chosen low-dimensional projec- 
tions, we randomly generated many projections (typically 1000) of each original 
data set, computed fi for each projection, then showed the /i values for each 
dimensionality (i.e., all fi values for 3-dimensional projections were averaged 
and reported; likewise all fi values for 4-dimensional projects were averaged 
and reported). One difficulty arose: computing /i for a projection required the 
number of distinct rows, and we projected from data sets that are at least a 
large fraction of our main-memory size. Gathering this data exactly appears 
too expensive. Instead, we computed the projection sizes in a few passes over 
our full data sets, using a probabilistic counting technique due to Cai et al. |59) 
that was shown by Aouiche and Lemire 6OJ to have a good performance. As an 
extra step, we corrected the distinct-row estimates so that they never exceeded 
the product of column cardinalities. To validate our estimates, we computed 
exact fj, values for two smaller data sets (TWEED [SU 1^2] with 52 dimensions 
and Ilk rows, and another with 13 dimensions and 581k rows) and observed 
our average fi estimates changed by less than 2%. KJV-4grams and Netflix only 
had 4 dimensions, and thus we used TWEED to get another high-dimensional 
data set. 

Figurc [TOa| shows that, after about 3 dimensions, /i grew roughly linearly with 
the number of dimensions: the fi formula's min(n, Ni i)/{n + c— 1) terms would 
typically approximate 1 for all but the first few dimensions. To illustrate this, 
we computed the expected value of fi for projections of uniformly distributed 
tables with various densities p (see Fig. lObl. 

This does not mean that any particular recursive sorting algorithm will be 
this far from optimal. Our fi is an upper bound on suboptimality, so it merely 
means that we have not given evidence that recursive sorting is necessarily good 
for higher-dimension data sets. For high-dimensional data sets, there could 
still be a significant advantage in going beyond lexicographic sorting or other 
recursive sorting approaches. However, for 2 or 3 dimensions, our fi values show 
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Table 3: RunCount after sorting various tables using different orderings. The 
up and down arrows indicate whether the columns where ordered in increasing 
or decreasing cardinality before sorting. Best results for each data set are in 
bold. 



table 


shuffled 


order 


lexico. 


Gray 


Hilbert 


Census-Income 


4.6x10^ 


; 
t 


3.2x10^ 
1.9 X 10^ 


3.2x10^ 
1.9 x 10^ 


3.4x10^ 
3.4x10^ 


Censusl881 


2.7x10^ 


t 


1.8x10^ 
1.3 X 10^ 


1.8x10^ 
1.3 x 10'^ 


2.0x10^ 
2.0x10^ 


DBGEN 


4.5x10^ 


t 


3.3x10^ 
1.2 X 10'^ 


3.3x10^ 
1.2 x 10'^ 


4.3x10^ 
4.3x10^ 


Netflix 


3.8x10^ 


t 


2.5x10^ 
1.2 X 10^ 


2.5x10* 
1.2 X 10* 


3.3x10* 
3.3x10* 


KJV-4grams 


3.4x10^ 


t 


3.9x10* 
3.9x10* 


3.8 x 10* 
3.8 X 10* 


8.2x10* 
8.2x10* 



that lexicographic sorting cannot be improved much. Of course, such projections 



may be nearly complete tables (cf. § 5.1 ) 



6.4- Column order matters, Gray codes do not 

Results for realistic data sets are given in Table [3] For these data sets, 
there were no noticeable benefits (within 1%) to Gray codes as opposed to 
lexicographic orders. The only data set showing some benefit (« 1%) is KJV- 
4grams. 

Relative to the shuffled case, ordering the columns in increasing cardinality 
reduced the number of runs by a factor of two (Census and Censusl881), three 
(DBGEN and Netfiix) or nine (KJV-4grams). Except for Netfiix and KJV- 
4grams, these gains drop to « 50% when using the wrong column order (by 
decreasing cardinality). On Netfiix, the difference between the two column 
orders is a factor of two (2.5x10* versus 1.2x10*). 

The data set KJV-4grams appears oblivious to column reordering. We are 
not surprised given that columns have similar cardinalities and distributions. 



6.5. Compact Hilbert Indexes are not competitive 

Hilbert is effective at improving the compression of database tables [551 [M] 
using tuple difference coding techniques [BS]. Moreover, for complete tables 
where the cardinality of all columns is the same power of two, sorting by Hilbert 
minimizes the number of runs (being a Gray code §|4|. However, we are unaware 
of any application of Hilbert to column-oriented indexes. 

To test Hilbert, we generated a small random table (see Table|4]) with moder- 
ately low density {p = 0.01). The RunCount result is far worse than recursive 
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Table 4: Comparison of Compact Hilbert Indexes with other orderings for a 
uniformly distributed table {p = 0.01, c = 5) and various column cardinalities. 
The number of runs is given in thousands. 



cardinalities 


shuffled 


lexico. 


reflected Gray 


modular Gray 


Hilbert 


4,8,16,32,64 


47.8 


18.9 


18.7 


18.7 


35.3 


64,32,16,8,4 


47.8 


28.5 


28.1 


28.2 


35.3 


16,16,16,16,16 


49.7 


23.7 


23.3 


23.4 


35.3 



ordering, even when all column cardinalities are the same power of 2. In this 
test, Hilbert is column-order oblivious. We have similar results over realistic 
data sets (see Table |3]). In some instances, Hilbert is nearly as bad as a random 
shuffle of the table, and always inferior to a mere lexicographic sort. For KJV- 
4grams, Hilbert is relatively effective — reducing the number of runs by a factor 
of 4 — but it is still half as effective as lexicographically sorting the data. 

6. 6. The order of values is irrelevant 

For several recursive orders (lexicographic and Gray codes), we reordered 
the attribute values by their frequency — putting the most frequent values first. 
While the number of runs in sorted uniformly distributed tables is oblivious 
to the order of attribute values, we may see some benefits with tables having 
skewed distributions. However, on the realistic data sets, the differences were 
small — less than 1% on all metrics for recursive ordering. 

For Hilbert, reordering the attribute values had small and inconsistent ef- 
fects. For Census, the number of runs went up from 3.4x10^ to 3.6x10^ (-1-6%), 
whereas for Netflix, it went down from 3.3x10^ to 3.2x10* (-3%). The strongest 
effect was observed with KJV-4grams where the number of runs went down from 
8.2x 10* to 7.6x 10* (-7%). These differences are never sufficient to make Hilbert 
competitive. 

6. 7. Skew and column dependencies reduce the number of runs 

We can compute the expected number of runs for uniformly distributed ta- 
bles sorted lexicographically by the proof of Proposition [5] For Census-Income, 
we compared this value with the number of runs for all possible column orders 



(see Fig. 11). Distribution skew and dependencies between columns make a 
substantial difference: the number of runs would be twice as high were Census- 
Income uniformly distributed with independent columns. 



CONCLUSION 

Unsurprisingly, an effective heuristic to minimize the number of runs and 
column-oriented index sizes is to sort lexicographically after reordering the 
columns in increasing cardinality. This heuristic is often recommended |13l 128] . 
However, our results stress the importance of reordering the columns. Picking 
the wrong column order can result in a moderate reduction of the number of 
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Figure 11: Number of runs for the Census-Income data set, including the ex- 
pected number assuming that the table is uniformly distributed. The columns 
are indexed from 1 to 4 in increasing cardinality. Hence, the label 1234 means 
that the columns are in increasing cardinality. 

700000 

650000 

600000 
« 550000 
5 500000 
o 450000 
S 400000 
E 350000 
Z 300000 

250000 

200000 

150000 

^co^cmcoc\i^co^-.-co-.-^cm^-.-cmt-coc\ico-^c\i-^ 
co^tM^CMcoco^-.-^-^coog^-^^-^CMcgcOT-co-'-CM 

C\lC\ICOC0^^-^-^C0CO^^-.--^CM(M^^-.--.-C\lCMC0C0 
■^-.-■.-■^■^■.-CMCMCMCMCMCMCOCOCOCOCOCO^^^^^^ 

runs (50%) whereas a large reduction is possible (2-3 x) when using the right 
column order. 

The benefit of recursive Gray-code orders over lexicographic orders is small. 
Sorting the values within columns has also small effects (< 1%) for several 
recursive orders. 

FUTURE WORK 

The first step toward the estimation of the size of column indexes under 
sorting is to assume that columns are statistically independent. However, it 
might possible to lift this assumption by modeling the dependency between 
columns [66 1 167 ] . 
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A. Table of Notation 



Notation explanation 
n 



defined used in 



M 

PNi,p' 



ilN2 



IIN2.P' 



runs in column i 
number of columns 
number of rows 
cardinality of column i 



P- 
P.p 
p. H 

p. 5 
p. 6 



recursive sorting is /z-optimal 
for the run minimization prob- P 
lem 

probability that a block of 
Ni tuples is nonempty 

same except individual tuples 
present with probability p' 
rather than default p 

with lexicographic sorting, 
probability that two nonempty 
blocks in column 2 seamlessly 
join 

same except individual tuples 
present with probability p' 



with reflected Gray sorting, 
probability that two nonempty 
blocks in column 2 seamlessly 
join 

same except individual tuples 
present with probability p' 



10 



15 



17 



20 



1 

throughout 
throughout 
throughout 
throughout 

throughout 



5.2 



5.2 



5.2 



5.2 



5.2 



5.2 



B. Maxima Computer Algebra System code 

For completing some of the proofs, we used Maxima version 5.12.0 [68] . 
Scripts ran during about 49 hours on a Mac Pro with two double-core Intel 
Xeon processors (2.66 GHz) and 2GiB of RAM. 

The proof of Lemma [4] uses the following code which ran for 185 minutes: 



32 



r(N2,p) :=l-(l-p)**N2; 

Pdd(N2,p) :=N2*p**2*(l-r(N2,p))/((l-p)*r(N2,p)**2) ; 
P: (l-Pdd(N3,p))*r(N3,p)*N2- (l-Pdd(N2,p))*r(N2,p)*N3 
-Pdd(N2,r(N3,p))*r(N2*N3,p)+Pdd(N3,r(N2,p))*r(N2*N3,p) ; 
P2:P*r(N2*N3,p) ; 

for n2:2 unless n2>30 do 
(display (ii2) , 

for ii3:n2+l unless n3>100 do 
( nr: nroots (factor (subst ( [N2=n2 ,N3=n3] ,P2) ), 0, 1) , 
if (not(nr=0)) then displayC'ERROR" ,n2,n3,nr))) ; 

The proof of Lemma |6] uses this code which ran for 46 hours: 

r(N2,p) :=l-(l-p)**N2; 

Pdd(N2,p) :=N2*p**2*(l-r(N2,p))/((l-p)*r(N2,p)**2) ; 
Pud(N2,p) :=p**2*(2-r(N2,p))/(r(N2,p)*(l-(l-p)**2)) ; 
Lainbda(N2,p) :=(Pud(N2,p)+(l-r(N2,p))*Pdd(N2,p))/(2-r(N2,p)) ; 
P: (l-Lainbda(N3,p))*r(N3,p)*N2- (l-Lainbda(N2,p))*r(N2,p)*N3 
-Lambda (N2 , r (N3 , p) ) *r (N2*N3 , p) +Lambda (N3 , r (N2 , p) ) *r (N2*N3 , p) ; 
P2:P*(2-r(N2*N3,p))*r(N2*N3,p) ; 

for n2:2 unless n2>30 do 
(display (n2) , 

for n3:n2+l unless n3>100 do 
( nr: nroots (factor (subst ( [N2=n2 ,N3=n3] ,P2) ), 0, 1) , 
if(not(nr=l)) then displayC'ERROR" ,n2,n3,nr))) ; 



C. Proof of Proposition [6] 

Proof. Define Tj^f "^^^^^^^ = N.N^Pn, - 3^^^%^^^ where S^^^f:^, 
defined as S'§^%^^ after substituting pN, for p. Define Tj,f§l%^p, 
^^t!w3?p„2 and T^f7^3*5^,,p similarly. As in the proof of Proposition [sj it is suffi- 
cient to prove that +nfNtt.P ^ 5^3?p«, +^^fi^f^.,P whenever 
-^2 ^ irrespective of the value of A^i (allowing iVi > N^). 
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We have 



Tireflected , rrircflcctcd AT AT ^ \ reflected^ at 

Nl,N2,PN3 ^ -'-NiN2,N3,p — ^^1^^2PN3 ~ ^N2,PN3 PN2.PN3^^1 



+N1N2N3P - A5^«<=^*^VjV3A^iiV2 
= (1 - A5^«°^'''^)pjV3iViiV2 

\ reflected at 

~^N2,PN3 PN2,PN3^^1 
+ N1N2N3P 

< (1 - A5^f ^"='^)pjv,iViiV3 

\ reflected „ at- 

-^N3,PN2 PN3,PN2-'^1 

+N1N2N3P + A5^f (by Lemma § 
+N,N2N3p - AJ^f ="=Vjv.A^iiV3 

T-ireflected 1 Tireflected 

— '^Ni,N3,pN2~^ ^NiN3,N2,p- 



This proves the result. □ 



D. A Related NP- Completeness Result 

In § |4.2| we showed it is NP-hard to order columns so as to minimize the 
RunCount value after lexicographic sorting. We now show a related problem 
is NP-complete. 

Column- Ordering-for-Minimax Lexicographic Runcount ( COMLR ). Given a ta- 
ble T, an ordering on the values found in each column, and an integer K, is it 
possible to reorder the columns of the table, such that when the reordered table 
is lexicographically sorted, no column has more than K runs? 

Proposition 7. COMLR is NP-complete. 

Proof. Membership in NP is obvious. We show COMLR is NP-hard by reduc- 
tion from 3SAT [351 L02]. Suppose our 3SAT instance has variables vi to v\y\ 
and clauses Ci to Cm- We assume that no clause contains both a variable and its 
negation because such a clause can be removed without affecting satisfiability. 

For every variable w^, the COMLR instance has three values that can appear 
in tables: Wi, Wi and 0^,^. They are ordered: Wi < Wi < Ou,. . Moreover, for 
a G {wijWijOwi}, b G {wj,Wj,Owj} and i ^ j, we have a < 6 if and only if i < j. 

Two other values are used in the table, -f oo and — oo whose orderings with 
respect to the other values are as expected. 

We construct a table T, with 3\V\ -\- 2 rows, and with a column for each 
possible literal and a column for each clause. Hence T has 2\V\ -\- m columns. 
We describe the columns from left to right, beginning with the columns for vi 
and vi. See Fig.[l2| 
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Consider the literal column associated with vi . It begins with a run of length 
3x1 — 2 with the —oo value. It then contains wi, wi, 0^^ . The remainder of 
the column is composed of +00. The next column is for vi. It begins and ends 
similarly, but in the middle it has Wi , 0^,^ , Wi . The pairs of columns for the 
remaining variables then follow. The column for Vi begins with a run containing 
3i — 2 copies of the —00 value, then has v)i,WiOwi, whereas the column for Vi 
has Wi, , Wi between the run of —00 and the run of +00. Thus, the left part 
of the table has blocks of size 3x2 arranged diagonally . Above the diagonal, 
we have —00; below the diagonal, we have +00. (Except that there is a row of 
—00 above everything and a row of +00 below everything.) 

To complete the construction, wc have one column per clause. Consider a 
clause {li,lj,lk} where h — Vi or li — Vi and similarly for Ij and Ik- Each 
column begins with —00 and ends with +00. Otherwise, the column copies the 
column for li within the zone of Vi, where the zone of variable Vi consists of 
rows 3z — 2,3i — l,3i in the table. The construction is such that no matter 
how columns are reordered, a lexicographic sort can rearrange rows only within 
their zones. Similarly, the column copies the columns for Ij and Ik within the 
zones of Vj and Vk, respectively. Otherwise, the part of the column that is in the 



zone of wi {I ^ {i,j,k}), contains 0^,, . See Fig. 12 for the table constructed for 
{{vi,V2,V3}, {vi,V2, V3},{vi,V3,Vi},{vi,V3,V4}}. Finally, we set the maximum- 
runs-per-column bound K — \ V\ +7. 

The construction creates literal columns that cannot have many runs no 
matter how we reorder columns and lexicographically sort the rows. Conse- 
quently these columns always meet the |y| + 7 bound. For clause columns: 
after any column permutation and lexicographic sorting, a clause column can 
have at most + 8 nms: 

• 2 for the —00 and the +00, 

• {\V\ — 3) for the variables that are not in the clause, 

• and at most 3 for each of the 3 variables that are in the clause. 

Table T can have its columns reordered to have at most |y| + 7 runs per 
column (after lexicographic sorting), if and only if the given instance of 3SAT 
is satisfiable. 

Suppose we have a satisfying truth assignment. If Vi is true, permute the 
columns for Vi and Uj. (Otherwise, leave them alone.) After permuting these 
columns, lexicographic sorting would swap the bottom two rows in the zone for 
Vi . Any clause containing Vi would find that this swap merges two runs of Wi in 
its column, and thus we would meet the \V\ + 7 bound for that clause's column. 
Likewise, if Vi is false, leave the two columns in their original relationship. The 
table as constructed was lexicographically sorted, and any clause containing Vi 
would continue to have a run of uii's and meet the run bound. Since we have a 
satisfying truth assignment, every clause column will contain at least one such 
run. 

Conversely, suppose we have permuted table columns such that the lexico- 
graphically sorted table has no column with more than |y| 4- 7 runs. Because 
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Figure 12: Example construction for {01,02,03,04}, where Oi = {vi,V2,V3}, 

02 = {V1,V2,V3}, 03 = {vi,V3,V4}, and 04 = {vi,V3,V4}. 

lexicographic sorting is restricted to rearranging rows only within their zones, a 
clause's column must contain a length- two run of Wi or iVi, for some 1 < « < 
The construction guarantees that if any clause column contains a length-two run 
of Wi, then no column contains a length- two run of Wi. Similarly, a length- two 
run of Wi precludes a length- two run of Wi. Moreover, by construction we see 
that a column containing the length- two run of Wi must contain Vi. Hence, we 
set Vi to true. Likewise, for any run of u), we set Vi to false. Clearly, this truth 
setting satisfies the original 3SAT instance. □ 
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